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Predictive Discriminant Analysis Versus Logistic Regression in 
Two-Group Classification Problems 

ABSTRACT. A method for comparing the cross-validated classification 
accuracies of predictive discriminant analysis and logistic regression 
classification models is presented under varying data conditions for the 
two-group classification problem. With this method, separate-group as 
well as total-sample proportions of correct classifications can be 
compared for the two models. McNemar’s test for contrasting 
correlated proportions is used in the statistical comparisons of the 
separate-group and total-sample proportions. The method is illustrated 
with 32 real data sets. 

Among the methods used for solving two-group classification problems, 
logistic regression (LR) and predictive discriminant analysis (PDA) are two of the 
most popular (Yamold, Hart, & Soltysik, 1994, p. 73). Unlike PDA, LR captures 
the probabilistic distribution embedded in dichotomous measures, avoids violations to 
the assumption of homogeneity of variance, and does not require strict multivariate 
normality (e.g., see Aldrich & Nelson, 1984; Cox & Snell, 1989). Therefore, when 
PDA assumptions are violated, we might theoretically expect higher cross-validated 
classification hit rate accuracy with LR than PDA. 

Although several studies have compared the classification accuracy of LR and 
PDA, the results have been inconsistent. For example, results of three simulation 
studies (Bardn, 1991; Bayne, Beauchamp, Kane, & McCabe, 1983; Crawley, 1979) 
suggest that LR is more accurate than PDA for nonnormal data. However, several 
researchers (e.g., Cleary & Angel, 1984; Dey & Astin, 1993; Knoke, 1982; 
Krzanowski, 1975; Press & Wilson, 1978) using nonnormal real data found little or 
no difference in the accuracy of the two techniques. Findings are also inconsistent for 
degree of group separation. Bayne, et al. (1993) found that larger group separation 
favored PDA, while Crawley (1979) found this condition to favor LR. Sample size is 
yet another data condition yielding inconsistent results. In a simulation study, Harrell 
and Lee (1985) found that PDA was more accurate than LR for small samples. By 
contrast, in a study by Johnston and Seshia (1992) using real data, LR worked better 
than PDA for small samples. 

Given these inconsistencies in the literature, it is not clear which of the two 
methods will work better for a given data set. Consequently, a method for comparing 
the accuracy of PDA and LR for a specific data set will enable researchers to select 
the optimal classification procedure for that data set. In this paper we describe a 
method for determining the superior classification model for a specific data set, 
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regardless of data conditions. In addition, a computer program that accomplishes the 
method is introduced and demonstrated. 
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Method 

Data Sources . We used 32 classification data sets varying in number of cases, 
relative group sizes, number of predictor variables, degree of group separation, and 
equality of group covariance matrices to illustrate the method. To bolster validity, all 
were taken from real classification studies. The sources include journal articles, 
paper presentations and research texts. 

Procedure . For PDA, we built linear classification functions based on assumptions of 
multivariate normality, equal covariance matrices, and equal prior probabilities of 
group membership. We classified cases into groups using Tatsuoka’s (1988) 
minimum chi square rule. For LR, we used the elegant IMSL subroutine CTGLM, 
conveniently available with the powerful new 32-bit Microsoft Fortran v4.0 
Powerstation, to obtain model coefficients. The CTGLM routine uses a standard 
nonlinear approximation technique (Newton-Raphson) to determine maximum 
likelihood estimates of model coefficients. We classified each case into the group 
with the highest log-likelihood probability. 

In comparing the predictive accuracy of the PDA model to that of the LR 
model, we considered external rather than internal results. Results of an internal 
classification analysis are those obtained when measures for the individuals on whom 
the statistics were based are resubstituted to obtain the predicted classification scores. 
In an external classification analysis statistics based on one set of individuals are used 
in classifying new individuals. An external analysis is appropriate for making 
inferences about the discriminatory power of the predictors for a new set of data 
(Huberty, 1994). 

We estimated external, or cross-validated, hit-rate accuracy using the leave- 
one-out procedure. A case is classified by applying the model derived from all cases 
except the one being classified. This process was repeated round-robin for each case 
with a count of the overall classification accuracy used to estimate the cross-validated 
accuracy. This procedure has a relatively wide following in the discriminant analysis 
literature (see, for example, Huberty, 1994; Huberty & Mourad, 1980; Lachenbruch, 
1967; Mosteller & Tukey, 1968). 

We compared separate-group as well as total-sample proportions of correct 
classification for the PDA and LR models. We used McNemar’s (1947) test for 
contrasting correlated proportions in the statistical comparisons between PDA and LR 
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models for the separate-group and total- sample proportions. This method was 
previously suggested for comparing full and reduced classification methods (Morris & 
Huberty, 1995; Morris & Meshbane, 1995) as well as for comparing linear and 
quadratic classification models (Meshbane & Morris, 1995), but is equally applicable 
in comparing PDA and LR models. (See Looney, 1988, for a method of comparing 
classification results of more than two models.) Because the calculation of the 
McNemar correlated proportion statistic requires the joint distribution of hits and 
misses for both the PDA and LR models, no statistical package will accomplish the 
method. Therefore, we wrote a FORTRAN computer program to provide this 
information. 

We used the Box test for testing the assumption of homogeneity of covariance 
structures. This test is sensitive to multivariate normality, and the outcome is 
therefore confounded with the homogeneity of dispersion issue. Nevertheless, the 
Box test is routinely used for testing the homogeneity of dispersion assumption and is 
even the default in some statistical packages. Notwithstanding concerns over this test, 
once could argue that, theoretically, a logistic classification model is more likely to be 
appropriate when the Box test indicates that the covariance structures are unequal. 

Results and Discussion 



For each of the data sets, Table 1 gives a short description, the degree of 
group separation (D), the number of cases in group 1 (n,), the number of cases in 
group 2 (n 2 ), an index of disproportionality of the group sizes (I) calculated as (a * 
100) / rij, where n; is the larger of the two groups, the number of predictor variables 
(g), results of the Box test for homogeneity of covariance structures, and a 
comparison of the leave-one-out performance of the PDA and LR models for each 
group separately and for the total sample. We compared the performance of the two 
classification models, displayed as the hit-rate percentage obtained by the g predictor 
variables, via McNemar’ s test for contrasting correlated proportions. 

To make an inferential decision using this method, the researcher must, as is 
customary, choose an alpha level; the choice of alpha level results in an associated 
critical z statistic. To illustrate the method for these data sets, we used the .01 alpha 
level with the associated z of 2.58. 

In the first three data sets, maximum likelihood estimates of logistic regression 
parameters could not be calculated due to complete separation of the data (see Table 
1). Among the remaining 29 data sets, differences between the PDA and LR models 
in classifying the total sample were not statistically significant, with the exception of 
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Data Set 26 (see Table 1). Here, the LR model yielded a significantly higher total hit 
rate. 



Insert Table 1 about here 



Differences between the two classification models in separate-group hit rates 
were statistically significant in nine of the 32 data sets (9, 16, 17, 23, 24, 26, 30, 31, 
and 32). In eight of these data sets, superior performance of the LR model in 
classifying the larger group was offset by superior performance of the PDA model in 
classifying the smaller group; the only exception was Data Set 31, in which the 
significant advantage of the LR model for the larger group was offset by a 
nonsignificant advantage of the PDA model for the smaller group. 

Statistically significant differences in separate-group hit rates were found in 
data sets with moderate to relatively large discrepancies in sample sizes and small to 
moderate group separation, but not in any data set with similar sample sizes (I < 

118) nor in any data set with relatively large group separation (D > 2.0). This 
indicates that CTGLM uses sample sizes as estimates of population sizes when 
generating maximum likelihood estimates of LR model parameters. Using sample 
sizes as estimates of population sizes is inappropriate, however, when population sizes 
are unknown or when sample sizes are not proportional to population sizes (Huberty, 
1994, p. 65). Consequently, we decided to force the assumption of equal population 
sizes by adjusting the LR model by a constant. We determined the value of the 
constant by referring to the FORTRAN program LOGDIS (Albert & Harris, 1987), 
which uses an iterative Newton-Raphson procedure to obtain maximum likelihood 
estimates of LR model parameters for the k-group classification problem. Under the 
assumption of equal population sizes, there were no statistically significant differences 
in total-sample or separate-group hit rates between LR and PDA for any of the 29 
data sets for which maximum likelihood estimates of LR model parameters could be 
calculated (results available on request). 

Therefore, from the perspective of these data sets, which were selected to 
portray a wider range of characteristics than were previously available, some evidence 
can be tentatively gleaned. For total-group accuracy, hit rates for the LR and PDA 
models were the same in 28 of the 29 data sets for which LR model estimates could 
be calculated. Neither theoretical nor data-based considerations were helpful in 
predicting which of the two models would work better. 
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In reference to separate-group accuracy the results are a bit more complicated. 
For separate-group hit rate to be of interest the researcher must make an a priori 
decision that accuracy in one group is more important than in the other. For 
example, the researcher may decide that, in predicting high school dropouts, to be 
correct for the dropouts is more important than for a persister group. These data 
suggest that when the size of one population is much larger than the other, the 
researcher may improve separate-group hit rate by choosing the LR model if interest 
is in classification accuracy of the larger group, and by choosing the PDA model if 
interest is in classification accuracy of the smaller group. 

It certainly may be that there are other data sets in which LR or PDA would 
be judged significantly superior for total-group as well as separate-group hit rates; 
from the results of these analyses, it seems quite possible that there may be data that 
manifest a PDA rule that is significantly superior for one group and an LR rule that is 
significantly superior for the other. In this case, if the researcher has interest in 
separate group accuracy, knowledge of these results would allow selection of a rule 
depending on which group is of highest superiority. Use of the method and computer 
program demonstrated herein will allow such decisions to be made based on explicit 
cross-validated classification accuracies. 



Note 

For a copy of the FORTRAN program that accomplishes the method, send a 
returnable diskette and diskette mailer to Alice Meshbane, College of Education, 
Florida Atlantic University, P.O. Box 3091, Boca Raton, FL 33431-0991. Internet: 
Meshbane@acc.fau.edu. 
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Data Set Description, Results of Box M Test for Equality of Covariance Matrices, and Comparison of Hit Rate Percents 
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